流量数据长期遭受缺失和腐败的困扰,从而导致随后的智能运输系统(ITS)应用程序的准确性和效用降低。注意到流量数据的固有低级属性,大量研究将缺少的流量数据恢复为低级张量完成(LRTC)问题。由于LRTC中的秩最小化的非跨性别性和离散性,现有方法要么用凸面替代等级代替等级替代等级函数,要么以涉及许多参数的非convex替代物,或近似等级。在这项研究中,我们提出了一个用于交通数据恢复的无参数的非凸张量完成模型(TC-PFNC),其中设计了基于日志的松弛项以近似张量代数级别。此外,以前的研究通常认为观察结果是可靠的,没有任何异常值。因此,我们通过对潜在的流量数据异常值进行建模,将TC-PFNC扩展到了强大的版本(RTC-PFNC),该数据可以从部分和损坏的观测值中恢复缺失的值并在观测中删除异常。基于交替的方向乘数法(ADMM)详细阐述了TC-PFNC和RTC-PFNC的数值解。在四个现实世界流量数据集上进行的广泛实验结果表明,所提出的方法在缺失和损坏的数据恢复中都优于其他最先进的方法。本文使用的代码可在以下网址获得:https://github.com/younghe49/t-ITSPFNC。
translated by 谷歌翻译
虽然先前以语音为导向的说话面部生成方法在改善合成视频的视觉质量和唇部同步质量方面取得了重大进展,但它们对唇部运动的关注较少,从而极大地破坏了说话面部视频的真实性。是什么导致运动烦恼,以及如何减轻问题?在本文中,我们基于最先进的管道对运动抖动问题进行系统分析,该管道使用3D面表示桥接输入音频和输出视频,并通过一系列有效的设计来改善运动稳定性。我们发现,几个问题可能会导致综合说话的面部视频中的烦恼:1)输入3D脸部表示的烦恼; 2)训练推导不匹配; 3)视频帧之间缺乏依赖建模。因此,我们提出了三种有效的解决方案来解决此问题:1)我们提出了一个基于高斯的自适应平滑模块,以使3D面部表征平滑以消除输入中的抖动; 2)我们在训练中对神经渲染器的输入数据增加了增强的侵蚀,以模拟推理中的变形以减少不匹配; 3)我们开发了一个音频融合的变压器生成器,以模拟视频帧之间的依赖性。此外,考虑到没有现成的指标来测量说话面部视频中的运动抖动,我们设计了一个客观的度量标准(运动稳定性指数,MSI),可以通过计算方差加速度的倒数来量化运动抖动。广泛的实验结果表明,我们方法对运动稳定的面部视频生成的优越性,其质量比以前的系统更好。
translated by 谷歌翻译
本文介绍了对体现药物(Genea)挑战2022的非语言行为的生成和评估的重生条目。Genea挑战提供了处理后的数据集并进行众包评估,以比较不同手势生成系统的性能。在本文中,我们探讨了基于多模式表示学习的自动手势生成系统。我们将WAVLM功能用于音频,FastText功能,用于文本,位置和旋转矩阵功能用于手势。每个模态都投影到两个不同的子空间:模态不变和特定于模态。为了学习模式间不变的共同点并捕获特定于模态表示的字符,在训练过程中使用了基于梯度逆转层的对抗分类器和模态重建解码器。手势解码器使用与音频中的节奏相关的所有表示和功能生成适当的手势。我们的代码,预培训的模型和演示可在https://github.com/youngseng/represture上找到。
translated by 谷歌翻译
本文介绍了Z-Code ++,这是一种针对抽象文本摘要优化的新的预训练的语言模型。该模型使用三种技术扩展了艺术编码器模型的状态。首先,我们使用两阶段的预训练过程来改善模型在低资源摘要任务上的性能。该模型首先是使用文本语料库进行语言理解的预先培训的,然后在汇总语料库中不断预先培训,以进行基础文本生成。其次,我们用分离的注意力层代替编码器中的自我发项层,其中每个单词都使用两个向量分别代表其内容和位置。第三,我们使用融合编码器,这是一种以层次方式编码长序列的简单而有效的方法。 Z-Code ++在13个文本摘要任务中的9个跨5种语言中创建了新的艺术状态。我们的模型的参数有效,因为它的表现优于XSUM上600倍较大的Palm-540b,并且在Samsum上的易经的200倍GPT3-175B较大。在零射击和少量设置中,我们的模型大大优于竞争模型。
translated by 谷歌翻译
关于图像数据的多标签学习已通过深度学习模型广泛利用。但是,对深CNN模型的监督培训通常无法发现足够的判别特征进行分类。结果,提出了许多自学方法来学习更多可靠的图像表示。但是,大多数自我监督的方法都集中在单个标签数据上,并缺乏具有多个对象的更复杂的图像。因此,我们提出了一种对象感知的自学方法(OASS)方法,以获取多标签学习的更细粒度表示,并根据对象位置动态生成辅助任务。其次,可以利用OAS学到的强大表示形式,以无提案方式有效地生成特定于类的实例(CSI),以更好地指导多标签监督信号传递到实例。对多标签分类的VOC2012数据集进行了广泛的实验,证明了该方法针对最先进的对应物的有效性。
translated by 谷歌翻译
我们提出了脱结的机器学习估计,用于共同参数,如局部平均处理效果,具有高维协调因子。为此,我们将整个类别的共同参数的双重强大时刻函数表征为Wald和$ \ Kappa $重量配方的组合。我们直接估计$ \ kappa $权重,而不是它们的组件,以消除反相倾向于高维协调因子的数值不稳定的步骤。我们证明我们的估算器是平衡的,一致,渐近的正常和半偏见的高效,并使用它来估计401(k)参与净金融资产分配的影响。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.
translated by 谷歌翻译